Data Ingestion using Amazon Kinesis Data Streams with S3 Data Lake
In this topic we will describe the creation of a data ingestion pipeline using Amazon Kinesis Data Streams as a data source with Databricks for data integration and ingesting the data into an S3 data lake.
Prerequisites
-
Access to a configured Amazon S3 instance which will be used as a data lake in the pipeline.
-
A configured instance of Amazon Kinesis Data Streams. For information about configuring Kinesis, refer to the following topic: Configuring Amazon Kinesis Data Streams
Creating a data ingestion pipeline
- On the home page of Data Pipeline Studio, add the following stages and connect them as shown below:
- Data Source: Amazon Kinesis Data Streams
- Data Integration: Databricks
- Data Lake: Amazon S3
-
Configure the Kinesis node and Amazon S3 node.
-
Click the Databricks node and click Create Job.
Complete the following steps to create a data integration job:
Running the data ingestion pipeline
After you have created the data integration job with Amazon Kinesis Data Streams, ensure that you publish the pipeline. If you haven't already done so, click Publish.you can run the pipeline in the following ways:
-
Click
. The Data Streams window opens, which provides a list of the data streams in the pipeline. Enable the toggle for the stream that you want to use to fetch data.
-
Click the Databricks node and then click Start to run the data integration job. Navigate to Data Streams and click Refresh. The Kinesis Streaming is enabled.
You can see that the data stream that you enabled is now running. Click the refresh icon to view the latest information about number of events processed.
Troubleshooting a failed data integration job
When you click the Databricks node in the pipeline, you know if your data integration job has failed looking at the status of the job.
-
Click the Databricks node in the pipeline.
-
Check the status of the Databricks integration jobThe status could be one of the following:
-
Running
-
Canceled
-
Pending
-
Failed
-
-
If the job status is seen as Failed, click the (...) ellipsis and then click Open Databricks Dashboard.
-
You are navigated to the specific Databricks job. This shows the list of job runs. Click the job run for which you want to view the details.
-
View the details and check for errors.
What's next? Data Ingestion using Amazon Kinesis Data Streams with Snowflake Data Lake |